feat(evaluators): add built-in budget evaluator for per-agent cost tracking by amabito · Pull Request #144 · agentcontrol/agent-control

amabito · 2026-03-21T02:09:46Z

Summary

New built-in evaluator "budget" that tracks cumulative token/cost usage
per agent, channel, user. Configurable time windows (daily/weekly/monthly).
Addresses feat(evaluators): add built-in budget evaluator for per-agent cost tracking #130.

Scope

User-facing/API changes:
- "budget" evaluator registered alongside regex/list/json/sql
- BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock)
- BudgetEvaluatorConfig: limits list, optional pricing table, path configs
Internal changes:
- evaluators/builtin/src/agent_control_evaluators/budget/ -- 4 files, ~650 LOC
- evaluators/builtin/tests/budget/test_budget.py -- 63 tests
Out of scope:
- No Postgres store, no DB tables, no new dependencies

Risk and Rollout

Risk level: low -- new evaluator, no changes to existing code. 230 existing tests untouched.
Rollback plan: revert PR

Testing

Added or updated automated tests (63 tests incl. thread safety, NaN/Inf, scope injection, double-count)
Ran pytest tests/ -- 293 passed
Manually verified behavior

Checklist

Linked issue/spec -- feat(evaluators): add built-in budget evaluator for per-agent cost tracking #130
Updated docs/examples -- follow-up: config example in docs/evaluators/
Included follow-up tasks -- Postgres BudgetStore (separate package)

lan17

Thank you for working on this! It looks overall in the right direction to me, but I think we can improve the design a bit to make it more universal.

Could you also implement this as a contrib evaluator so it doesn't become a built-in evaluator right away? We will then put it to use and once its in a production ready stage we can move to builtin.

lan17 · 2026-03-24T05:00:50Z

evaluators/builtin/src/agent_control_evaluators/budget/config.py

+    scope: dict[str, str] = Field(default_factory=dict)
+    per: str | None = None
+    window: Literal["daily", "weekly", "monthly"] | None = None
+    limit_usd: float | None = None


It seems like this is a wrong abstraction for budgeting. What if the budget is in different currency? Why do we need separate fields for tokens vs usd?

It might be better to do something like an integer for a limit and then define "currency" Enum which could be USD, tokens, Euros, etc.,

I don't think there's a use case for having floating point for USD or Euros, no?

Good point. I'll switch to an integer limit + a Currency enum (USD, EUR, tokens, etc.). Float was unnecessary — cents-level precision is handled by using integer cents anyway.

lan17 · 2026-03-24T05:01:39Z

evaluators/builtin/src/agent_control_evaluators/budget/config.py

+
+    scope: dict[str, str] = Field(default_factory=dict)
+    per: str | None = None
+    window: Literal["daily", "weekly", "monthly"] | None = None


What if we want hourly or half-an-hour?

Maybe its better to define "window" as an integer in seconds, or minutes? That way you can express whatever window you want.

Agreed. Will change to window_seconds: int. I'll keep a few named constants (DAILY = 86400 etc.) as convenience but the field itself will be raw seconds.

lan17 · 2026-03-24T05:03:12Z

evaluators/builtin/src/agent_control_evaluators/budget/config.py

+    on the first breach.
+
+    Attributes:
+        scope: Static scope dimensions that must match for this rule


Worth it to give some examples here for scope?

Sure, will add. Something like {"agent": "summarizer", "channel": "slack"}.

lan17 · 2026-03-24T05:05:37Z

evaluators/builtin/src/agent_control_evaluators/budget/config.py

+    limits: list[BudgetLimitRule] = Field(min_length=1)
+    pricing: dict[str, dict[str, float]] | None = None
+    token_path: str | None = None
+    cost_path: str | None = None


Shouldn't this evaluator be computing cost in USD based on model and token count?

Doesn't seem like an LLM step should be passing it down here. I maybe wrong on this.

My thinking was: the caller already has cost from the LLM response, so passing it avoids maintaining a pricing table. But I see the argument — if the evaluator owns cost computation, the contract is simpler and the caller can't lie about cost. One question: should the evaluator maintain its own pricing table, or pull from an external source (e.g. LiteLLM's model cost map)?

I believe you already include a pricing table in the evaluator config, so for version 1 of this evaluator we can just rely on that?

Makes sense. I'll have the evaluator compute cost from the config pricing table + model + token counts. That lets me drop cost_path entirely — just need token_path and a new model_path to extract the model name from the input. If the model isn't in the pricing table, fail closed.

lan17 · 2026-03-24T05:07:45Z

evaluators/builtin/src/agent_control_evaluators/budget/store.py

+    return max(ratios) if ratios else 0.0
+
+
+class InMemoryBudgetStore:


nit: This should go into a separate file so that interface for store and its implementation are separate.

Will split into store.py (protocol) and memory_store.py (InMemoryBudgetStore).

lan17 · 2026-03-24T05:11:04Z

evaluators/builtin/src/agent_control_evaluators/budget/store.py

+        """Atomically record usage and return current budget state.
+
+        Args:
+            scope_key: Composite scope identifier (e.g. "channel=slack|user_id=u1").


this doc should go into interface docs instead of here

Yep, will move to the protocol definition.

lan17 · 2026-03-24T05:11:56Z

evaluators/builtin/src/agent_control_evaluators/budget/store.py

+    def record_and_check(
+        self,
+        scope_key: str,
+        period_key: str,


Why is this needed to be passed in? Shouldn't implementation be figuring this out on its own based on current time?

Fair. The store should derive the period from window_seconds + current time internally. I was passing it in for testability but I can inject a clock instead.

lan17 · 2026-03-24T05:13:14Z

evaluators/builtin/src/agent_control_evaluators/budget/store.py

+        input_tokens: int,
+        output_tokens: int,
+        cost_usd: float,
+        limit_usd: float | None = None,


Why do we need to pass in limit here? Can't we instantiate the store with the BudgetEvaluatorConfig or something similar so it knows what limits are already?

If we want to share the store between many different kinds of limits and keys, can't we just pass in BudgetLimitRule here instead?

Makes sense. I'll have the store accept BudgetLimitRule at registration time so it knows its own limits. record_and_check just takes usage data.

lan17 · 2026-03-24T05:14:58Z

evaluators/builtin/tests/budget/test_budget.py

+# ---------------------------------------------------------------------------
+
+
+class TestInMemoryBudgetStore:


Tests should follow Given/When/Then behavioral comment style.

Will restructure. Quick example to confirm this is what you mean:

def test_single_record_under_limit(self): # Given: store with a $10 daily limit # When: record $3 of usage # Then: not breached, ratio ~0.3

lan17 · 2026-03-24T05:17:51Z

evaluators/builtin/src/agent_control_evaluators/budget/config.py

+    Attributes:
+        scope: Static scope dimensions that must match for this rule
+            to apply. Empty dict = global rule.
+        per: If set, the limit is applied independently for each unique


I don't quite understand this. Can't we just handle separate budgets by having multiple rules with different scope dicts?

For static scopes, agreed — multiple rules work. My concern is the dynamic case: "each user gets $10/day" where users aren't known at config time. With per, one rule covers all future users. Without it, you'd need to generate rules on the fly. Would a group_by field work? e.g. group_by: "user_id" means "apply this limit independently per distinct user_id value." Open to other approaches if you have something in mind.

amabito · 2026-03-24T06:37:23Z

@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week.

lan17 · 2026-03-30T15:40:54Z

@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week.

any updates @amabito ? Would it be ok with you if we take over this PR to get it over the finish line? This feature is going to be super useful to the whole project!

amabito · 2026-03-31T00:12:57Z

@lan17 Hey — update is mostly ready, was waiting on your reply on the cost computation thread before pushing. Got that answer now so I'll push the revision shortly.

…cking New contrib evaluator "budget" that tracks cumulative token/cost usage per agent, channel, user. Configurable time windows via window_seconds. Design per reviewer feedback: - Contrib evaluator (not builtin) for production hardening - Integer limit + Currency enum (USD/EUR/tokens) - window_seconds (int) instead of named windows - group_by for dynamic per-user/per-channel budgets - Evaluator owns cost computation from pricing table - BudgetStore protocol + InMemoryBudgetStore (dict + Lock) - Store derives period keys internally, injectable clock Addresses agentcontrol#130. 55 tests (incl. thread safety, NaN/Inf, scope injection, double-count).

codecov · 2026-04-01T19:35:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

lan17

i think there are still a couple architectural issues to sort out before this is ready. the main ones for me are: budget state currently lives on a cached evaluator instance, per-call rounding materially overstates spend for small requests, and the currency model doesn't line up with what the evaluator actually stores. i also don't think the BudgetStore abstraction is quite a clean swap point yet.

lan17 · 2026-04-01T20:49:52Z

evaluators/contrib/budget/src/agent_control_evaluator_budget/budget/evaluator.py

+
+    def __init__(self, config: BudgetEvaluatorConfig) -> None:
+        super().__init__(config)
+        self._store = InMemoryBudgetStore(rules=config.limits)


i think this runs into a fundamental mismatch with the evaluator framework. evaluator instances are cached globally by (name, config) and reused across requests, so putting InMemoryBudgetStore on self means different controls with the same config will share budget state. that leaks usage across controls/selectors in a way that's going to be very hard to reason about. i don't think the fix is to bake control_id into config, but i do think this needs some unique engine-provided evaluation key / control-instance identity so the store can namespace state outside the cached evaluator instance.

lan17 · 2026-04-01T20:49:52Z

evaluators/contrib/budget/src/agent_control_evaluator_budget/budget/evaluator.py

+    cost = (input_tokens * input_rate + output_tokens * output_rate) / 1000.0
+    if not math.isfinite(cost) or cost < 0:
+        return 0
+    return math.ceil(cost)


using ceil here means every non-zero call costs at least 1 minor unit. for example, 1 token at 30 cents / 1k becomes 1 whole cent instead of 0.03, so a stream of small calls will burn through budget way too fast. i think we need higher-precision accumulation here and should only round for display / external reporting, not on every call.

lan17 · 2026-04-01T20:49:52Z

evaluators/contrib/budget/src/agent_control_evaluator_budget/budget/evaluator.py

+            if val is not None:
+                model = str(val)
+
+        cost = _estimate_cost(model, input_tokens, output_tokens, self.config.pricing)


i don't think the currency story is coherent yet. we compute a single cost value from the pricing table and then feed that same integer into every matching rule, regardless of whether the rule is USD, EUR, or TOKENS. so EUR rules are effectively checked against USD-denominated spend, and TOKENS doesn't really map to this cost path at all. i'd either constrain this to USD for now or add explicit per-unit accounting / conversion before shipping it.

lan17 · 2026-04-01T20:49:52Z

evaluators/contrib/budget/src/agent_control_evaluator_budget/budget/store.py

+
+
+@runtime_checkable
+class BudgetStore(Protocol):


i like adding a BudgetStore protocol, but right now it doesn't feel like a clean abstraction boundary yet. important behavior assumptions like constructor-time rules, clock source, and internal period derivation live outside the protocol, and the evaluator still directly instantiates InMemoryBudgetStore. so in practice a redis/postgres backend still requires changing evaluator construction, not just swapping implementations. i'd either make store injection/provider selection part of the design now or keep this concrete until that boundary is clearer.

lan17 reviewed Mar 24, 2026

View reviewed changes

amabito force-pushed the feat/budget-evaluator branch from fa414d7 to a1345b9 Compare March 31, 2026 00:29

lan17 requested changes Apr 1, 2026

View reviewed changes

		return max(ratios) if ratios else 0.0


		class InMemoryBudgetStore:

		# ---------------------------------------------------------------------------


		class TestInMemoryBudgetStore:

Conversation

amabito commented Mar 21, 2026

Summary

Scope

Risk and Rollout

Testing

Checklist

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amabito commented Mar 24, 2026

Uh oh!

lan17 commented Mar 30, 2026

Uh oh!

amabito commented Mar 31, 2026

Uh oh!

codecov bot commented Apr 1, 2026

Codecov Report

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone